Loading [MathJax]/extensions/MathMenu.js

A first taste of Bayes

Let’s try some Bayesian data analysis

Coin flips with prop_model

The function prop_model has been loaded into your workspace. It implements a Bayesian model that assumes that:

Assume you just flipped a coin four times and the result was heads, tails, tails, heads. If you code heads as a success and tails as a failure then the following R codes runs prop_model with this data

data <- c(1, 0, 0, 1)
prop_model(data)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggridges)

prop_model <- function (data = c(), prior_prop = c(1, 1), n_draws = 10000, 
    show_plot = TRUE) 
{
    data <- as.logical(data)
    proportion_success <- c(0, seq(0, 1, length.out = 100), 1)
    data_indices <- round(seq(0, length(data), length.out = min(length(data) + 
        1, 20)))
    post_curves <- map_dfr(data_indices, function(i) {
        value <- ifelse(i == 0, "Prior", ifelse(data[i], "Success", 
            "Failure"))
        label <- paste0("n=", i)
        probability <- dbeta(proportion_success, prior_prop[1] + 
            sum(data[seq_len(i)]), prior_prop[2] + sum(!data[seq_len(i)]))
        probability <- probability/max(probability)
        tibble(value, label, proportion_success, probability)
    })
    post_curves$label <- fct_rev(factor(post_curves$label, levels = paste0("n=", 
        data_indices)))
    post_curves$value <- factor(post_curves$value, levels = c("Prior", 
        "Success", "Failure"))
    p <- ggplot(post_curves, aes(x = proportion_success, y = label, 
        height = probability, fill = value)) + geom_density_ridges(stat = "identity", 
        color = "white", alpha = 0.8, panel_scaling = TRUE, size = 1) + 
        scale_y_discrete("", expand = c(0.01, 0)) + scale_x_continuous("Underlying proportion of success") + 
        scale_fill_manual(values = hcl(120 * 2:0 + 15, 100, 65), 
            name = "", drop = FALSE, labels = c("Prior   ", "Success   ", 
                "Failure   ")) + theme_light(base_size = 18) + 
        theme(legend.position = "top")
    if (show_plot) {
        print(p)
    }
    invisible(rbeta(n_draws, prior_prop[1] + sum(data), prior_prop[2] + 
        sum(!data)))
}
data <- c(1, 0, 0, 1)
prop_model(data)

The model knows it’s not close to 0% or close to 100%, but believes it could be anything in between.

Zombie drugs with prop_model

If we really were interested in the underlying proportion of heads of this coin then prop_model isn’t particularly useful. Since it assumes that any underlying proportion of success is equally likely prior to seeing any data it will take a lot of coin flipping to convince prop_model that the coin is fair. This model is more appropriate in a situation where we have little background knowledge about the underlying proportion of success.

# Update the data and rerun prop_model
data = c(rep.int(0, 11), 1, 1)
prop_model(data)

It seems like it’s not a perfect drug, but between 5% to 40% cured zombies is better than nothing!

Samples and posterior summaries

Looking at samples from prop_model

Here again is the prop_model function which has been given the data from our zombie experiment where two out of 13 zombies got cured. In addition to producing a plot, prop_model also returns a large random sample from the posterior over the underlying proportion of success.

data = c(1, 0, 0, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0)
         
# Extract and explore the posterior
posterior <- prop_model(data)

head(posterior)
## [1] 0.1272290 0.2716066 0.1393354 0.2466503 0.1782281 0.1298544
# Plot the histogram of the posterior
hist(posterior, breaks = 30, xlim = c(0, 1), col = "palegreen4")

Hey, that’s a good looking plot!

Summarizing the zombie drug experiment

The point of working with samples from a probability distribution is that it makes it easy to calculate new measures of interest. The following tasks are about doing just this!

A point estimate is a single number used to summarize what’s known about a parameter of interest. It can be seen as a “best guess” of the value of the parameter. A commonly used point estimate is the median of the posterior. It’s the midpoint of the distribution, and it’s equally probable for the parameter value to be larger than the median as it is to be smaller than it.

# Calculate the median
median(posterior)
## [1] 0.1875193

So, a best guess is that the drug would cure around 18% of all zombies. Another common summary is to report an interval that includes the parameter of interest with a certain probability. This is called a credible interval (CI). With a posterior represented as a vector of samples you can calculate a CI using the quantile() function.

# Calculate the credible interval
quantile(posterior, c(0.05, 0.95))
##         5%        95% 
## 0.06091095 0.38548720

According to the credible interval, there is a 90% probability that the proportion of zombies the drug would cure is between 6% and 38%. (Here we have to be careful to remember that the percentage of cured zombies and the percentage of probability are two different things.)

Now, there is a rival zombie laboratory that is also working on a drug. They claim that they are certain that their drug cures 7% of the zombies it’s administered to. Can we calculate how probable it is that our drug is better? Yes, we can! But it’s a two stage process.

# Calculate the sum
sum(posterior > 0.07)
## [1] 9303
# Calculate the probability
sum(posterior > 0.07)/length(posterior)
## [1] 0.9303

It seems there is a large probability (93%) that our zombie drug is better!

You’ve done some Bayesian data analysis!

The parts needed for Bayesian inference

Take a generative model for a spin

Below you have the R code that implements the generative model we just developed.

# The generative zombie drug model

# Set parameters
prop_success <- 0.42
n_zombies <- 100

# Simulating data
data <- c()
for(zombie in 1:n_zombies) {
  data[zombie] <- runif(1, min = 0, max = 1) < prop_success
}

data <- as.numeric(data)
data
##   [1] 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 0 0 1 1 1 1 0 1 1 1 0 0 0
##  [38] 0 0 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 0 1 0 0 1 1 1 0 1 0 1 1 1 1 0 1 1 0 0 1
##  [75] 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 1 1 0
# Count cured
data <- sum(data)
data
## [1] 51

Perfect! Some zombies got cured in this simulation, but far from all.

Take the binomial distribution for a spin

It turns out that the generative model you ran last exercise already has a name. It’s called the binomial process or the binomial distribution. In R you can use the rbinom function to simulate data from a binomial distribution. The rbinom function takes three arguments:

# Try out rbinom
rbinom(n = 1, size = 100, prob = 0.42)
## [1] 47
# Change the parameters
rbinom(n = 200, size = 100, prob = 0.42)
##   [1] 46 47 47 37 42 37 42 42 43 39 44 39 40 48 42 50 35 44 46 41 42 32 49 47 51
##  [26] 36 38 46 44 41 34 42 39 29 42 45 40 36 48 47 43 39 44 45 40 41 32 41 47 52
##  [51] 33 36 43 50 34 48 36 38 50 43 40 42 43 48 39 40 41 48 43 48 47 41 43 41 39
##  [76] 43 31 52 50 41 39 49 46 45 45 36 44 46 41 37 43 35 46 42 44 45 45 40 34 36
## [101] 43 43 42 37 38 40 39 37 47 35 43 44 44 41 47 33 41 46 38 45 32 41 43 47 43
## [126] 41 39 38 43 44 42 39 41 39 40 42 43 42 39 44 42 45 37 41 49 41 41 38 46 49
## [151] 35 49 44 40 39 36 45 47 45 43 50 43 53 42 43 30 41 50 44 43 45 36 37 38 42
## [176] 46 35 38 41 43 43 44 39 44 37 39 31 49 45 37 43 37 43 52 49 29 36 40 37 38

Nice! That’s a lot of simulated zombies right there.

Using a generative model

How many visitors could your site get (1)?

To get more visitors to your website you are considering paying for an ad to be shown 100 times on a popular social media site. According to the social media site, their ads get clicked on 10% of the time.

# Fill in the parameters
n_samples <- 100000
n_ads_shown <- 100
proportion_clicks <- 0.10
n_visitors <- rbinom(n_samples, size = n_ads_shown, 
                     prob = proportion_clicks)

# Visualize n_visitors
hist(n_visitors, breaks = 25, col = "palegreen4")

Representing uncertainty with priors

Adding a prior to the model

You’re not so sure that your ad will get clicked on exactly 10% of the time. Instead of assigning proportion_clicks a single value you are now going to assign it a large number of values drawn from a probability distribution.

For now, we are going to assume that it’s equally likely that proportion_clicks could be as low as 0% or as high as 20%.

n_samples <- 100000
n_ads_shown <- 100

# Update proportion_clicks
proportion_clicks <- runif(n = n_samples, min = 0.0, max = 0.2)

n_visitors <- rbinom(n = n_samples, size = n_ads_shown, prob = proportion_clicks)

Because the rbinom function is vectorized the first value of proportion_clicks is used to sample the first value in n_visitors, the second value in proportion_clicks is used for the second in n_visitors, and so on. The result is that the samples in n_visitors now also incorporate the uncertainty in what the underlying proportion of clicks could be.

# Visualize proportion clicks
hist(proportion_clicks, col = "blue")

You shouldn’t be surprised to see that the uncertainty over proportion_clicks is just as you specified it to be: Uniform between 0.0 and 0.2 (except for some small variations in the height of the bars because we took a random sample using runif).

# Visualize n_visitors
hist(n_visitors, col = "palegreen4")

Uncertainty changed the distribution. We went from a random distribution of about 90% to, with added uncertainty of 0 to 20%, 70% likelihood that someone will click on an ad.

Bayesian models and conditioning

Update a Bayesian model with data

You ran your ad campaign, and 13 people clicked and visited your site when the ad was shown a 100 times. You would now like to use this new information to update the Bayesian model.

# Create the prior data frame
prior <- data.frame(proportion_clicks, n_visitors)

# Examine the prior data frame
head(prior)
##   proportion_clicks n_visitors
## 1        0.04581983          4
## 2        0.17995833         17
## 3        0.13082784         14
## 4        0.17001025         11
## 5        0.07207851          8
## 6        0.10097822         12

The reason we’ve called it prior is because it represents the uncertainty before (that is, prior to) having included the information in the data. Let’s do that now!

prior$n_visitors represented the uncertainty over how many visitors you would get because of the ad campaign. But now you know you got exactly 13 visitors.

# Create the posterior data frame
posterior <- prior[prior$n_visitors == 13, ]

# Visualize posterior proportion clicks
hist(posterior$proportion_clicks, col = "palegreen4")

How many visitors could your site get (3)?

In the last exercise, you updated the probability distribution over the underlying proportions of clicks (proportion_clicks) using new data. Now we want to use this updated proportion_clicks to predict how many visitors we would get if we reran the ad campaign.

The result from the last exercise is still in the data frame posterior, but if you look at posterior$n_visits you’ll see it’s just 13 repeated over and over again. This makes sense as posterior represents what the model knew about the outcome of the last ad campaign after having seen the data.

# Assign posterior to a new variable called prior
prior <- posterior

# Take a look at the first rows in prior
head(prior)
##     proportion_clicks n_visitors
## 9          0.08682619         13
## 20         0.15113032         13
## 107        0.11840011         13
## 114        0.19432296         13
## 117        0.09465449         13
## 120        0.10919364         13
# Replace prior$n_visitors with a new sample and visualize the result
n_samples <-  nrow(prior)
n_ads_shown <- 100
prior$n_visitors <- rbinom(n_samples, size = n_ads_shown,
                           prob = prior$proportion_clicks)
hist(prior$n_visitors, col = "palegreen4")

# Calculate the probability that you will get 5 or more visitors
sum(prior$n_visitors >= 5)/length(prior$n_visitors)
## [1] 0.9873123

It’s pretty probable that you’ll get more than 5 visitors.

What have we done?

Joint Probability Distribution

Chapter 4: Computational Methods

Four good things with Bayes

Explore using the Beta distribution as a prior

The Beta distribution is a useful probability distribution when you want model uncertainty over a parameter bounded between 0 and 1. Here you’ll explore how the two parameters of the Beta distribution determine its shape.

# Explore using the rbeta function
beta_sample <- rbeta(n = 1000000, shape1 = 1, shape2 = 1)

# Visualize the results
hist(beta_sample, col = "blue")

Right! A Beta(1,1) distribution is the same as a uniform distribution between 0 and 1. It is useful as a so-called non-informative prior as it expresses than any value from 0 to 1 is equally likely.

# Modify the parameters
beta_sample <- rbeta(n = 1000000, shape1 = -1, shape2 = 1)
## Warning in rbeta(n = 1e+06, shape1 = -1, shape2 = 1): NAs produced
# Explore the results
head(beta_sample)
## [1] NaN NaN NaN NaN NaN NaN

Yes, NaN stands for not a number and the reason you got a lot of NaNs is that the Beta distribution is only defined when its shape parameters are positive.

# Modify the parameters
beta_sample <- rbeta(n = 1000000, shape1 = 100, shape2 = 100)

# Visualize the results
hist(beta_sample, col = "blue")

So the larger the shape parameters are, the more concentrated the beta distribution becomes. When used as a prior, this Beta distribution encodes the information that the parameter is most likely close to 0.5.

# Modify the parameters
beta_sample <- rbeta(n = 1000000, shape1 = 100, shape2 = 20)

# Visualize the results
hist(beta_sample, col = "blue")

So the larger the shape1 parameter is the closer the resulting distribution is to 1.0 and the larger the shape2 the closer it is to 0.

Change the model to use an informative prior

The code is the old model you developed from scratch in chapter 2.

n_draws <- 100000
n_ads_shown <- 100

# Change the prior on proportion_clicks
proportion_clicks <- 
  rbeta(n_draws, shape1 = 5, shape2 = 95)
n_visitors <- 
  rbinom(n_draws, size = n_ads_shown, 
         prob = proportion_clicks)
prior <- 
  data.frame(proportion_clicks, n_visitors)
posterior <- 
  prior[prior$n_visitors == 13, ]

# This plots the prior and the posterior in the same plot
par(mfcol = c(2, 1))
hist(prior$proportion_clicks, 
     xlim = c(0, 0.25))
hist(posterior$proportion_clicks, 
     xlim = c(0, 0.25))

Take a look at the new posterior! Due to the new informative prior it has shifted to the left, favoring lower rates.

Contrasts and comparisons

Fit the model using another dataset

Let’s fit the binomial model to both the video ad data (13 out of 100 clicked) and the new text ad data (6 out of a 100 clicked).

# Define parameters
n_draws <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_draws, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_draws, size = n_ads_shown, 
                     prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)

# Create the posteriors for video and text ads
posterior_video <- prior[prior$n_visitors == 13, ]
posterior_text <- prior[prior$n_visitors == 6, ]

# Visualize the posteriors
hist(posterior_video$proportion_clicks, xlim = c(0, 0.25))

hist(posterior_text$proportion_clicks, xlim = c(0, 0.25))

Looking at the histogram of posterior_text it can be said that the value of the proportion of clicks for the text ad is likely between 0.03 and 0.13.

Calculating the posterior difference

The posterior proportion_clicks for the video and text ad has been put into a single posterior data frame. The reason for [1:4000] is because these proportion_clickss are not necessarily of the same length, which they need to be when put into a data frame.

Now it’s time to calculate the posterior probability distribution over what the difference in proportion of clicks might be between the video ad and the text ad.

posterior <- data.frame(
    video_prop = posterior_video$proportion_clicks[1:4000],
    text_prop  = posterior_text$proportion_click[1:4000])
    
# Calculate the posterior difference: video_prop - text_prop
posterior$prop_diff <- posterior$video_prop - posterior$text_prop

# Visualize prop_diff
hist(posterior$prop_diff, col = "palegreen4")

# Calculate the median of prop_diff
median(posterior$prop_diff)
## [1] 0.06637898
# Calculate the proportion
mean(posterior$prop_diff > 0)
## [1] 0.9435

Given the model and the data, the probability that the video ad is better than the text ad is 95%. It’s pretty likely that the video ad is better.

Decision analysis

A small decision analysis 1

Each visitor spends $2.53 on average, a video ad costs $0.25 and a text ad costs $0.05. Let’s figure out the probable profit when using video ads and text ads!

visitor_spend <- 2.53
video_cost <- 0.25
text_cost <- 0.05

# Add the column posterior$video_profit
posterior$video_profit <- posterior$video_prop * visitor_spend - video_cost

# Add the column posterior$text_profit
posterior$text_profit <- posterior$text_prop * visitor_spend - text_cost

# Visualize the video_profit and text_profit columns
par(mfcol = c(2,1))
hist(posterior$video_profit, xlim = c(-0.2, 0.4))
hist(posterior$text_profit, xlim = c(-0.2, 0.4))

Great! Take a look at the two histograms you’ve plotted, which method seems most profitable, if any?

A small decision analysis 2

Using the columns video_profit and text_profit that you added to posterior in the last exercise, let’s conclude the decision analysis.

# Add the column posterior$profit_diff
posterior$profit_diff <- posterior$video_profit - posterior$text_profit

# Visualize posterior$profit_diff
hist(posterior$profit_diff)

# Calculate a "best guess" for the difference in profits
median(posterior$profit_diff)
## [1] -0.03206119
# Calculate the probability that text ads are better than video ads
mean(posterior$profit_diff < 0)
## [1] 0.62

So it seems that the evidence does not strongly favor neither text nor video ads. But if forced to choose at this point, we would choose text ads.

Even though text ads get a lower proportion of clicks, they are also much cheaper. And, as you have calculated, there is a 63% probability that text ads are better.

Change anything and everything

The Poisson distribution

The Poisson distribution simulates a process where the outcome is a number of occurrences per day/year/area/unit/etc. Before using it in a Bayesian model, let’s explore it!

# Simulate from a Poisson distribution and visualize the result
x <- rpois(n = 10000, lambda = 3)
hist(x)

Let’s say that you run an ice cream stand and on cloudy days you on average sell 11.5 ice creams. It’s a cloudy day.

x <- rpois(n = 10000, lambda = 11.5)
hist(x)

It’s still a cloudy day, and unfortunately, you won’t break even unless you sell 15 or more ice creams.

mean(x >= 15)
## [1] 0.1879

It is not likely that you will break even on a cloudy day, so we should stay at home.

Clicks per day instead of clicks per ad

When you put up a banner on your friend’s site you got 19 clicks in a day, how many daily clicks should you expect this banner to generate on average? Now, modify your model, one piece at a time, to calculate this.

# Change the model according to instructions
n_draws <- 100000
mean_clicks <- runif(n_draws, min = 0, max = 80)
n_visitors <- rpois(n = n_draws, mean_clicks)

prior <- data.frame(mean_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 19, ]

# Visualize mean_clicks
par(mfcol = c(2,1))
hist(prior$mean_clicks)
hist(posterior$mean_clicks)

The model is complete! Like before you could now calculate credible probability intervals using quantile or calculate the probability of getting more than, say, 15 clicks next day. But just looking at the posterior probability distribution:

The range of expected daily clicks is 12 to 28 daily clicks on average.

Bayes is optimal, kind of…

Probability rules

Cards and the sum rule

A standard French-suited deck of playing cards contains 52 cards; 13 each of hearts (♥), spades (♠), clubs (♣), and diamonds (♦). Assuming that you have a well-shuffled deck in front of you, the probability of drawing any given card is 1/52 ≈ 1.92%.

# Calculate the probability of drawing any of the four aces
prob_to_draw_ace <- 1/52 * 4

Cards and the product rule

Again, assuming that you have a well-shuffled deck in front of you, the probability of drawing any given card is 1/52 ≈ 1.92% . The probability of drawing any of the four aces is 1/52 + 1/52 + 1/52 + 1/52 = 4/52. Once an ace has been drawn, the probability of picking any of the remaining three is 3/51. If another ace is drawn the probability of picking any of the remaining two is 2/50, and so on.

# Calculate the probability of picking four aces in a row
prob_to_draw_four_aces <- 4/52 * 3/51 * 2/50 * 1/49

Yes! The probability to draw four aces is 4/52 * 3/51 * 2/50 * 1/49 = 0.0004%. Pretty unlikely!

Calculating likelihoods

From rbinom to dbinom

Below is currently code that

  1. Simulates the number of clicks/visitors (n_clicks) from 100 shown ads using the rbinom function given that the underlying proportion of clicks is 10%.
  2. Calculates the probability of getting 13 visitors (prob_13_visitors).

That is, in probability notation it’s calculating P(n_visitors = 13 | proportion_clicks = 10%).

# Rewrite this code so that it uses dbinom instead of rbinom
n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- rbinom(n = 99999, 
    size = n_ads_shown, prob = proportion_clicks)
prob_13_visitors <- sum(n_visitors == 13) / length(n_visitors)
prob_13_visitors
## [1] 0.07390074
n_ads_shown <- 100
proportion_clicks <- 0.1
prob_13_visitors <- dbinom(13, 
    size = n_ads_shown, prob = proportion_clicks)
prob_13_visitors
## [1] 0.07430209

Alright! You got a similar result as with rbinom, but dbinom is much more efficient!

Calculating probabilities with dbinom

Below is roughly the code you ended up with in the last exercise.

# Change the code to calculate probability distribution of n_visitors
n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- 13
prob <- dbinom(n_visitors, 
    size = n_ads_shown, prob = proportion_clicks)
prob
## [1] 0.07430209

Currently the code calculates P(n_visitors = 13 | proportion_clicks = 10%): The probability of getting 13 visitors given that the proportion of clicks is 10%.

Change the code to instead calculate P(n_visitors | proportion_clicks = 10%): The probability distribution over all possible numbers of visitors.

n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- seq(0, 100, 1)
prob <- dbinom(n_visitors, 
    size = n_ads_shown, prob = proportion_clicks)
prob
##   [1]  2.656140e-05  2.951267e-04  1.623197e-03  5.891602e-03  1.587460e-02
##   [6]  3.386580e-02  5.957873e-02  8.889525e-02  1.148230e-01  1.304163e-01
##  [11]  1.318653e-01  1.198776e-01  9.878801e-02  7.430209e-02  5.130383e-02
##  [16]  3.268244e-02  1.929172e-02  1.059153e-02  5.426525e-03  2.602193e-03
##  [21]  1.170987e-03  4.956559e-04  1.977617e-04  7.451890e-05  2.656461e-05
##  [26]  8.972934e-06  2.875940e-06  8.758007e-07  2.537042e-07  6.998736e-08
##  [31]  1.840408e-08  4.617512e-09  1.106279e-09  2.532895e-10  5.545880e-11
##  [36]  1.161994e-11  2.331161e-12  4.480309e-13  8.253201e-14  1.457830e-14
##  [41]  2.470212e-15  4.016606e-16  6.269305e-17  9.395858e-18  1.352434e-18
##  [46]  1.870032e-19  2.484342e-20  3.171501e-21  3.890962e-22  4.587982e-23
##  [51]  5.199713e-24  5.664175e-25  5.930440e-26  5.967739e-27  5.771270e-28
##  [56]  5.363200e-29  4.788572e-30  4.107157e-31  3.383290e-32  2.676049e-33
##  [61]  2.031815e-34  1.480375e-35  1.034671e-36  6.934301e-38  4.454325e-39
##  [66]  2.741123e-40  1.615140e-41  9.106926e-43  4.910597e-44  2.530420e-45
##  [71]  1.245128e-46  5.845669e-48  2.616117e-49  1.114936e-50  4.520010e-52
##  [76]  1.741041e-53  6.363454e-55  2.203794e-56  7.220406e-58  2.234162e-59
##  [81]  6.516307e-61  1.787738e-62  4.602579e-64  1.109055e-65  2.493907e-67
##  [86]  5.216014e-69  1.010856e-70  1.807405e-72  2.966699e-74  4.444493e-76
##  [91]  6.035732e-78  7.369636e-80  8.010474e-82  7.656367e-84  6.335055e-86
##  [96]  4.445653e-88  2.572716e-90  1.178793e-92  4.009500e-95  9.000000e-98
## [101] 1.000000e-100
# Plot the distribution
plot(n_visitors, prob, type = "h")

It seems that with proportion_clicks = 10% the most probable n_visitors values are between 1 and 20. Now, let’s flip the equation:

# Change the code according to the instructions
n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- 13
prob <- dbinom(n_visitors, 
    size = n_ads_shown, prob = proportion_clicks)
prob
##   [1]  0.000000e+00  2.965956e-11  1.004526e-07  8.009768e-06  1.368611e-04
##   [6]  1.001075e-03  4.265719e-03  1.247940e-02  2.764481e-02  4.939199e-02
##  [11]  7.430209e-02  9.703719e-02  1.125256e-01  1.178532e-01  1.129620e-01
##  [16]  1.001234e-01  8.274855e-02  6.419966e-02  4.701652e-02  3.265098e-02
##  [21]  2.158348e-02  1.362418e-02  8.234325e-03  4.775927e-03  2.663369e-03
##  [26]  1.430384e-03  7.408254e-04  3.704422e-04  1.790129e-04  8.366678e-05
##  [31]  3.784500e-05  1.657584e-05  7.032793e-06  2.891291e-06  1.151996e-06
##  [36]  4.448866e-07  1.665302e-07  6.041614e-08  2.124059e-08  7.234996e-09
##  [41]  2.386939e-09  7.624614e-10  2.357105e-10  7.048636e-11  2.037726e-11
##  [46]  5.691404e-12  1.534658e-12  3.991862e-13  1.000759e-13  2.415778e-14
##  [51]  5.609229e-15  1.251336e-15  2.678760e-16  5.495443e-17  1.078830e-17
##  [56]  2.023515e-18  3.620178e-19  6.166397e-20  9.980560e-21  1.531703e-21
##  [61]  2.223762e-22  3.046572e-23  3.927965e-24  4.752038e-25  5.377247e-26
##  [66]  5.671478e-27  5.554432e-28  5.030231e-29  4.193404e-30  3.201904e-31
##  [71]  2.227032e-32  1.402449e-33  7.942805e-35  4.015572e-36  1.797200e-37
##  [76]  7.054722e-39  2.403574e-40  7.024314e-42  1.737424e-43  3.582066e-45
##  [81]  6.048981e-47  8.199196e-49  8.713462e-51  7.062754e-53  4.226413e-55
##  [86]  1.795925e-57  5.170371e-60  9.521923e-63  1.044590e-65  6.239308e-69
##  [91]  1.807405e-72  2.180415e-76  8.911963e-81  9.240821e-86  1.591196e-91
##  [96]  2.358848e-98 1.001493e-106 1.546979e-117 8.461578e-133 6.239651e-159
## [101]  0.000000e+00
plot(proportion_clicks, prob, type = "h")

The plot you just produced almost looks like it shows the probability distribution over different values of proportion_clicks, but it does not. For one, the values in prob do not sum up to one. What you have calculated is the likelihood of different values of proportion_clicks to result in n_visitors = 13.

Looking at the plot, the value of proportion_clicks that seems to give the maximum likelihood to produce n_visitors = 13 is proportion_clicks = 0.13

What you’ve found by eyeballing this graph is the so-called maximum likelihood estimate of proportion_clicks.

Bayesian calculation

Calculating a joint distribution

Below, you have parts of the code we developed in the last video. It defines a grid over the underlying proportions of clicks (proportion_clicks) and possible outcomes (n_visitors) in pars. It adds to it the prior probability of each parameter combination and the likelihood that each proportion_clicks would generate the corresponding n_visitors.

n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- seq(0, 100, by = 1)
pars <- expand.grid(proportion_clicks = proportion_clicks,
                    n_visitors = n_visitors)
pars$prior <- dunif(pars$proportion_clicks, min = 0, max = 0.2)
pars$likelihood <- dbinom(pars$n_visitors, 
    size = n_ads_shown, prob = pars$proportion_clicks)

# Add the column pars$probability and normalize it
pars$probability <- pars$likelihood * pars$prior
pars$probability <- pars$probability / sum(pars$probability)

Conditioning on the data (again)

Let’s resurrect the zombie site example where you tested text ads. Out of a 100 impressions of the text ad, 6 out of a 100 clicked and visited your site.

Below is roughly the code you developed in the last exercise. pars is currently the joint distribution over all combinations of proportion_clicks and n_visitors.

n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- seq(0, 100, by = 1)
pars <- expand.grid(proportion_clicks = proportion_clicks,
                    n_visitors = n_visitors)
pars$prior <- dunif(pars$proportion_clicks, min = 0, max = 0.2)
pars$likelihood <- dbinom(pars$n_visitors, 
    size = n_ads_shown, prob = pars$proportion_clicks)
pars$probability <- pars$likelihood * pars$prior
pars$probability <- pars$probability / sum(pars$probability)
# Condition on the data 
pars <- pars[pars$n_visitors == 6, ]
# Normalize again
pars$probability <- pars$probability / sum(pars$probability)
# Plot the posterior pars$probability
plot(pars$proportion_clicks, pars$probability, type = "h")

Cool! You have now calculated (rather than simulated) your first posterior probability distribution!

A conditional shortcut

Great, you’ve now done some Bayesian computation, without doing any simulation! The plot you produced should be similar to the posterior distribution you calculated in chapter 3. However, you see that it required an awful lot of code, isn’t there anything we can cut?

Yes, there is! You can directly condition on the data, no need to first create the joint distribution.

# Simplify the code below by directly conditioning on the data
n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- 6
pars <- expand.grid(proportion_clicks = proportion_clicks)
pars$prior <- dunif(pars$proportion_clicks, min = 0, max = 0.2) 
pars$likelihood <- dbinom(n_visitors, 
    size = n_ads_shown, prob = pars$proportion_clicks)
pars$probability <- pars$likelihood * pars$prior
pars$probability <- pars$probability / sum(pars$probability)
plot(pars$proportion_clicks, pars$probability, type = "h")

Unnecessary lines are removed, the code is easier to read, and the program runs quicker! Nice!

Bayes’ theorem

The temperature in a Normal lake

rnorm, dnorm, and the weight of newborns

Here is a small data set with the birth weights of six newborn babies in grams.

c(3164, 3362, 4435, 3542, 3578, 4529)
# Assign mu and sigma
mu <- 3600
sigma <- 300

weight_distr <- rnorm(n = 100000, mean = mu, sd = sigma)
hist(weight_distr, 60, xlim = c(0, 6000), col = "lightgreen")

# Create weight
weight <- seq(0, 6000, 100)

# Calculate likelihood
likelihood <- dnorm(weight, mu, sigma)

# Plot the distribution of weight
plot(weight, likelihood, type = 'h')

A Bayesian model of water temperature

A Bayesian model of Zombie IQ

Zombies are stupid, and you and your colleagues at the National Zombie Research Laboratory are interested in how stupid they are. Below, you have the Normal model we developed in the last video, but with the temperature data switched out with some zombie IQs fresh from the lab. What we’re interested in is how much we can learn about the mean zombie IQ from this data. The model is complete, save for that we need to calculate the probability of each parameter combination in pars.

# The IQ of a bunch of zombies
iq <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46)
# Defining the parameter grid
pars <- expand.grid(mu = seq(0, 150, length.out = 100), 
                    sigma = seq(0.1, 50, length.out = 100))
# Defining and calculating the prior density for each parameter combination
pars$mu_prior <- dnorm(pars$mu, mean = 100, sd = 100)
pars$sigma_prior <- dunif(pars$sigma, min = 0.1, max = 50)
pars$prior <- pars$mu_prior * pars$sigma_prior
# Calculating the likelihood for each parameter combination
for(i in 1:nrow(pars)) {
  likelihoods <- dnorm(iq, pars$mu[i], pars$sigma[i])
  pars$likelihood[i] <- prod(likelihoods)
}
# Calculate the probability of each parameter combination
pars$probability <- pars$likelihood * pars$prior / sum(pars$likelihood * pars$prior)

Ok! The code for calculating the likelihood was a bit messy, but Bayes theorem stays the same.

Eyeballing the mean IQ of zombies?

In the last exercise, you computed the probability for each mean (mu) and SD (sigma) combination. Using the levelplot function from the lattice package we can now visualize this 2D probability distribution:

library(lattice)
library(RColorBrewer)
coul <- colorRampPalette(brewer.pal(9, "Blues"))(25)[1:25]
levelplot(probability ~ mu * sigma, data = pars, col.regions = coul)

A mean IQ of 42 ± 10

That seems about right! Zombies are not very smart it seems. Who could have guessed…

Answering the question: Should I have a beach party?

Sampling from the zombie posterior

Again pars contains the data frame representing the posterior zombie IQ distribution you calculated earlier. The code to the right draws sample_indices: a sample of row numbers (a.k.a. indices) from the posterior. Now, let’s sample from pars to calculate some new measures!

head(pars)
##         mu sigma    mu_prior sigma_prior        prior likelihood probability
## 1 0.000000   0.1 0.002419707  0.02004008 4.849113e-05          0           0
## 2 1.515152   0.1 0.002456367  0.02004008 4.922578e-05          0           0
## 3 3.030303   0.1 0.002493009  0.02004008 4.996010e-05          0           0
## 4 4.545455   0.1 0.002529617  0.02004008 5.069373e-05          0           0
## 5 6.060606   0.1 0.002566174  0.02004008 5.142633e-05          0           0
## 6 7.575758   0.1 0.002602661  0.02004008 5.215754e-05          0           0
sample_indices <- sample( nrow(pars), size = 10000,
    replace = TRUE, prob = pars$probability)
head(sample_indices)
## [1] 3321 3834 1329 1628 2030 1932
# Sample from pars to calculate some new measures
pars_sample <- pars[sample_indices, c("mu", "sigma")]

# Visualize the mean IQ
hist(pars_sample$mu)

# Calculate quantiles
quantile(pars_sample$mu, c(0.025, 0.5, 0.975))
##     2.5%      50%    97.5% 
## 34.84848 42.42424 50.00000

But how smart will the next zombie be?

So we have an idea about what the mean zombie IQ is but what range of zombie IQs should we expect? And how likely is it that the next zombie you encounter is, at least, moderately intelligent?

head(pars_sample)
##            mu     sigma
## 3321 30.30303 16.733333
## 3834 50.00000 19.253535
## 1329 42.42424  6.652525
## 1628 40.90909  8.164646
## 2030 43.93939 10.180808
## 1932 46.96970  9.676768
pred_iq <- rnorm(10000, mean = pars_sample$mu, 
                 sd = pars_sample$sigma)

# Visualize pred_iq
hist(pred_iq)

# Calculate the probability of a zombie being "smart" (+60 IQ)
mean(pred_iq >= 60)
## [1] 0.0872

Zombies with an IQ of 60 or more are of moderate intelligence, and much more dangerous! (They can open doors!)

The risk is relatively low but still very real. I always carry my zombie repellent spray!

A practical tool: BEST

Low values for df increases outliers in data

The BEST models and zombies on a diet

The t-test is a classical statistical procedure used to compare the means of two data sets. In 2013 John Kruschke developed a souped-up Bayesian version of the t-test he named BEST (standing for Bayesian Estimation Supersedes the t-test). Let’s try out BEST as implemented in the BEST package.

library(BEST)
## Loading required package: HDInterval
# The IQ of zombies on a regular diet and a brain based diet.
iq_brains <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
iq_regular <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46)

# Calculate the mean difference in IQ between the two groups
mean(iq_brains) - mean(iq_regular)
## [1] 9.3
# Fit the BEST model to the data from both groups
best_posterior <- BESTmcmc(iq_brains, iq_regular)
## Waiting for parallel processing to complete...
## done.
# Plot the model result
plot(best_posterior)

This plot shows the posterior probability distribution over the difference in means between iq_brains and iq_regular. On top of this you get:

  1. The mean of the posterior as a “best guess” for the difference.

  2. A 95% credible interval (called a 95% Highest Density Interval in the plot).

  3. The amount of probability below and above zero difference.

There is some evidence that eating brains makes zombies smarter, but it’s uncertain by how much.

BEST is robust

The Bayesian model behind BEST assumes that the generative model for the data is a t-distribution; a more flexible distribution than the normal distribution as it assumes that data points might be outliers to some degree. This makes BEST’s estimate of the mean difference robust to outliers in the data.

# The IQ of zombies given a regular diet and a brain based diet.
iq_brains <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
iq_regular <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 150)

# Modify the data to include an outlier and calculate the difference in means
mean(iq_brains) - mean(iq_regular)
## [1] -1.1
# Fit the BEST model to the data from both groups
best_posterior <- BESTmcmc(iq_brains, iq_regular)
## Waiting for parallel processing to complete...done.
# Plot the model result
plot(best_posterior)

Looking at the plot, we see that the mutant zombie data point has made BEST more uncertain to some degree. But since BEST is robust to outliers, it still estimates that brain-eating zombies are more likely to have a higher IQ than zombies on a regular diet.

There is weak evidence that eating brains make zombies smarter. And we should be better at screening for mutant zombies when doing experiments.

What have you learned? What did we miss?